-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【Hackathon 5th No.2】为 Paddle 新增 index_fill API RFC #621
Conversation
- index (Tensor) - 包含索引的一维张量,可以为int32和int64 | ||
- value (float) - 张量填充的值,可以为int32, int64, float32, float64 | ||
- name (str) - 具体用法请参见 [Name](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_guides/low_level/program.html#api-guide-name),一般无需设置,默认值为 None。 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
x作为Tensor,其取值、赋值,理论上不应与dtype有关,如无特殊情况应支持所有dtype。由于这里依赖其他api,如果是由于其他API的问题,可以特别指出一下
-
axis参数的默认行为设计,辛苦先补充一下其他竞品的行为作为对比。如果是参考Paddle其他类似API,可以在
现状
章节介绍一下 -
value同x,理论应当支持所有dtype, 此外需要考虑下是否支持0-dtensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return paddle.reshape(out, x_dim_vec) | ||
~~~ | ||
|
||
索引的遍历参考了cummax/cummin算子的CPU实现,[链接](https://github.com/PaddlePaddle/Paddle/pull/53546/files#diff-0417a927e0148c22ecb722f950e2f9704d6e899e9899521f0a269b173ceb2de2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
整体看起来,这段代码由于是在Python端执行两层循环,是否会有性能问题
index_put
的索引是按轴顺序索引,是否可以对比下transpose + index_put的方式,初步判断下两者的性能差异
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我理解的transpose + index_put方式是通过transpose的方式将指定axis放到最前,由此将需要覆盖的元素聚集,取出下标后再放入index_put函数,得到结果后再transpose回原来的形状,下面的代码实现了一部分逻辑:
arr = np.random.random((4, 3, 2)).astype('float64')
pd_arr = paddle.to_tensor(arr)
tor_arr = torch.tensor(arr)
index = [0, 2]
print(paddle.transpose(pd_arr, perm=[0, 1, 2]))
print(torch.transpose(torch.index_fill(tor_arr, 0, torch.tensor(index), -1), 0, 0))
print(paddle.transpose(pd_arr, perm=[1, 0, 2]))
print(torch.transpose(torch.index_fill(tor_arr, 1, torch.tensor(index), -1), 0, 1))
因此可以粗略估计性能差异:
假设tensor数据量为N,rank为R,size为ndim,index元素个数为L,再假设reshape、flatten、transpose形状转变函数复杂度为O(N),index_put计算量相同都设为O(P)
当前方式:一次flatten、一次reshape,索引扫描L次,每次扫描数据量为(N/ndim[axis])=S,因此总时间复杂度为:2*O(N)+L*O(N/ndim[axis])+O(P)
transpose + index_put方式:两次transpose,构造R个索引(按轴),每个索引构造时间复杂度L*O(N/ndim[axis]),因此总时间复杂度为:2*O(N)+(R-1+L)*L*O(N/ndim[axis])+O(P)
因为要构造索引所以python端循环并没有减少,反而更多,我的理解是循环遍历N/ndim[axis]*L个索引位置(构造数组)是难以避免的,通过展开的方式可以将构造索引位置数组的开销由R次降到1次,可能更优。
但如果说transpose的实现比reshape、flatten更高效,那可能是transpose + index_put的方式更优。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.时间复杂度的分析是OK的,但是考虑到同样的操作在python/c++,cpu/gpu执行的效率是不同的,最好拿最后的时间数据进行比较。
2.目前transpose操作默认是启用stride机制,输出原Tensor的view,对view进行inplace改会直接体现在原Tensor上,理论上Inplace版本的操作不会再需要transpose回来。
辛苦简单实现下,用少量case对比下时间数据呢
- value (float) – the value to fill with | ||
``` | ||
输入用于定位的dim和index,原地修改tensor对应位置的值为value | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以细化一下这个API的情况,例如支持的dtype;index是否有rank要求;value是否支持0-d tensor,complex类型等等;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
## 命名与参数设计 | ||
|
||
API设计为`paddle.index_fill(x, axis, index, value, name)`以及`paddle.index_fill_(x, axis, index, value, name)`。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
从相似api,与index_add / index_select的统一的角度出发,建议index在前,axis在后
return paddle.reshape(out, x_dim_vec) | ||
~~~ | ||
|
||
索引的遍历参考了cummax/cummin算子的CPU实现,[链接](https://github.com/PaddlePaddle/Paddle/pull/53546/files#diff-0417a927e0148c22ecb722f950e2f9704d6e899e9899521f0a269b173ceb2de2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.时间复杂度的分析是OK的,但是考虑到同样的操作在python/c++,cpu/gpu执行的效率是不同的,最好拿最后的时间数据进行比较。
2.目前transpose操作默认是启用stride机制,输出原Tensor的view,对view进行inplace改会直接体现在原Tensor上,理论上Inplace版本的操作不会再需要transpose回来。
辛苦简单实现下,用少量case对比下时间数据呢
对上面两种方法简单测试,case input大小为[400, 300, 20],计算十次,计算时间间隔,flatten+index_input和transpose+index_input都是2s左右,我尝试transpose之后直接在tensor维度上赋值,测试只需0.05s左右,几乎两个数量级的提升,代码也比较简单,请问这个思路可行吗,刚刚提交了一版代码:PaddlePaddle/Paddle#57416 |
@Patrick-Star125 就使用快的这个方案就行,辛苦也修改下RFC文档的方案部分,合入之后再review下开发代码的PR |
已修改 |
if in_dynamic_mode(): | ||
out[index] = value | ||
else: | ||
out = paddle.static.setitem(out, index, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方建议直接调index_put或者index_put_,索引解析和分发到具体OP还有额外逻辑,会增加耗时。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
新增 index_fill API 设计文档